{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Metrics for evaluating machine learning models\n",
    "This notebook explores different metrics that can be used for evaluating various machine learning models. Here we use data obtained from functions provided by sklearn."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import cuml\n",
    "import sklearn\n",
    "\n",
    "from cuml.cluster import KMeans\n",
    "from cuml.ensemble import RandomForestRegressor as curfr\n",
    "from cuml.ensemble import RandomForestClassifier as curfc\n",
    "from cuml.manifold.umap import UMAP as cuUMAP\n",
    "\n",
    "from sklearn import manifold\n",
    "from sklearn.datasets import make_classification, make_regression\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.datasets.samples_generator import make_blobs\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define Parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_samples = 2**10\n",
    "n_features = 100\n",
    "n_info = 70"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating the required datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "centers = round(n_samples*0.4)\n",
    "X_blobs, y_blobs = make_blobs(n_samples=n_samples, centers=centers,\n",
    "                              n_features=n_features)\n",
    "X_class, y_class = make_classification(n_samples=n_samples, n_features=n_features,\n",
    "                                       n_clusters_per_class=1, n_informative=n_info,\n",
    "                                       random_state=123, n_classes=2)\n",
    "X_class = X_class.astype(np.float32)\n",
    "y_class = y_class.astype(np.int32)\n",
    "X_class_train, X_class_test, y_class_train, y_class_test = train_test_split(X_class,\n",
    "                                                                            y_class, train_size=0.8,\n",
    "                                                                            random_state=10)\n",
    "X_reg, y_reg = make_regression(n_samples=n_samples, n_features=n_features,\n",
    "                               n_informative=n_info, random_state=123)\n",
    "X_reg = X_reg.astype(np.float32)\n",
    "y_reg = y_reg.astype(np.float32)\n",
    "X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(X_reg,\n",
    "                                                                    y_reg, train_size=0.8,\n",
    "                                                                    random_state=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Trustworthiness\n",
    "It is a measure of the extent to which the local structure is retained in the embedding of the model. Therefore, if a sample predicted by the model lied within the unexpected region of the nearest neighbors, then those samples would be penalized. For more information on the trustworthiness metric please refer to: https://scikit-learn.org/dev/modules/generated/sklearn.manifold.t_sne.trustworthiness.html\n",
    "\n",
    "the documentation for cuML's implementation of the trustworthiness metric is: https://rapidsai.github.io/projects/cuml/en/stable/api.html#cuml.metrics.trustworthiness.trustworthiness\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_embedded = cuUMAP(n_neighbors=10).fit_transform(X_blobs)\n",
    "X_embedded = X_embedded.astype(np.float32)\n",
    "\n",
    "cu_score = cuml.metrics.trustworthiness(X_blobs, X_embedded)\n",
    "sk_score = sklearn.manifold.trustworthiness(X_blobs, X_embedded)\n",
    "\n",
    "print(\" cuml's trustworthiness score : \", cu_score)\n",
    "print(\" sklearn's trustworthiness score : \", sk_score)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Adjusted random index\n",
    "This is a metrics which is used to measure the similarity between two data clusters, and it is adjusted to take into consideration the chance grouping of elements.\n",
    "For more information on Adjusted random index please refer to: https://en.wikipedia.org/wiki/Rand_index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cuml_kmeans = KMeans(n_clusters=10)\n",
    "X_blobs = StandardScaler().fit_transform(X_blobs)\n",
    "cu_y_pred = cuml_kmeans.fit_predict(X_blobs).to_array()\n",
    "\n",
    "cu_adjusted_rand_index = cuml.metrics.cluster.adjusted_rand_score(y_blobs, cu_y_pred)\n",
    "sk_adjusted_rand_index = sklearn.metrics.cluster.adjusted_rand_score(y_blobs, cu_y_pred)\n",
    "\n",
    "print(\" cuml's adjusted random index score : \", cu_adjusted_rand_index)\n",
    "print(\" sklearn's adjusted random index score : \", sk_adjusted_rand_index)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## R^2 score\n",
    "R^2 score is also known as the coefficient of determination. It is used as a metric for scoring regression models. It scores the output of the model based on the proportion of total variation of the model.\n",
    "For more information on the R^2 score metrics please refer to: https://en.wikipedia.org/wiki/Coefficient_of_determination\n",
    "\n",
    "For more information on cuML's implementation of the r2 score metrics please refer to : https://rapidsai.github.io/projects/cuml/en/stable/api.html#cuml.metrics.regression.r2_score\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cuml_reg_model = curfr(max_features=1.0, n_bins=8, max_depth=10,\n",
    "                       split_algo=0, min_rows_per_node=2,\n",
    "                       n_estimators=30)\n",
    "cuml_reg_model.fit(X_reg_train,y_reg_train)\n",
    "cu_preds = cuml_reg_model.predict(X_reg_test,y_reg_test)\n",
    "\n",
    "cu_r2 = cuml.metrics.r2_score(y_reg_test, cu_preds)\n",
    "sk_r2 = sklearn.metrics.r2_score(y_reg_test, cu_preds)\n",
    "\n",
    "print(\"cuml's r2 score : \", cu_r2)\n",
    "print(\"sklearn's r2 score : \", sk_r2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Accuracy score\n",
    "Accuracy score is the ratio of correct predictions to the total number of predictions. It is used to measure the performance of classification models. \n",
    "For more information on the accuracy score metric please refer to: https://en.wikipedia.org/wiki/Accuracy_and_precision\n",
    "\n",
    "For more information on cuML's implementation of accuracy score metrics please refer to: https://rapidsai.github.io/projects/cuml/en/0.10.0/api.html#cuml.metrics.accuracy.accuracy_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cuml_class_model = curfc(max_features=1.0, n_bins=8, max_depth=10,\n",
    "                   split_algo=0, min_rows_per_node=2,\n",
    "                   n_estimators=30)\n",
    "cuml_class_model.fit(X_class_train,y_class_train)\n",
    "cu_preds = cuml_class_model.predict(X_class_test,y_class_test)\n",
    "\n",
    "cu_accuracy = cuml.metrics.accuracy_score(y_class_test, cu_preds)\n",
    "sk_accuracy = sklearn.metrics.accuracy_score(y_class_test, cu_preds)\n",
    "\n",
    "print(\"cuml's accuracy score : \", cu_accuracy)\n",
    "print(\"sklearn's accuracy score : \", sk_accuracy)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
